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Abstract 

Coexpression of genes or, more generally, similarity in the expression 
profiles poses an unsurmountable obstacle to inferring the gene regula- 
tory network (GRN) based solely on data from DNA microarray time 
series. Clustering of genes with similar expression profiles allows for a 
course-grained view of the GRN and a probabilistic determination of the 
connectivity among the clusters. We present a model for the temporal 
evolution of a gene cluster network which takes into account interactions 
of gene products with genes and, through a non-constant degradation rate, 
with other gene products. The number of model parameters is reduced 
by using polynomial functions to interpolate temporal data points. In 
this manner, the task of parameter estimation is reduced to a system of 
linear algebraic equations, thus making the computation time shorter by 
orders of magnitude. To eliminate irrelevant networks, we test each GRN 
for stability with respect to parameter variations, and impose restrictions 
on its behavior near the steady state. We apply our model and methods 
to DNA microarray time series' data collected on Escherichia coU during 
glucose-lactose diauxie and infer the most probable cluster network for 
different phases of the experiment. 

Keywords: Gene regulatory networks, DNA microarray, gene product 
degradation, gene clustering, optimization, dynamic robustness 



1 Introduction 

The information encoded in the genome of living organisms has presented a new 
level of complexity that continues to challenge the mind. Both theoretical and 
experimental studies have proven to be very difficult mainly due to the large 
dimension of the system of interacting genes. Even the simplest of prokaryotic 
cells contain some 4000 genes of which a significant fraction participates directly 
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or indirectly in regulating (enhancing or inhibiting) the expression of one an- 
other. Despite these daunting obstacles, progress in unraveling Gene Regulatory 
Networks (GRN) has been made mainly due to new experimental methods that 
allow for more detailed studies of the intricate mechanisms within the living 
cells. 

In particular, owing to the development of DNA microarrays techniques, it 
has become possible to probe the behavior of thousands of genes simultaneously 
over a certain course of time. The advantage of recording the temporal evolution 
of the genome (or at least a large part of it) as compared to having the same 
information on only a handful of genes is obviously quite important and has 
led to the onset of new types of studies, e. g. see [T] and [2]. The challenge 
of inferring the GRN however still persists due to the abovementioned problem 
of large dimensionality, but also because of (still) partial and noisy data. This 
makes improvements in both theory and experiment equally important. 

The difficulty of dealing with a large number of genes has forced one to pro- 
ceed to various simplifications of the GRN problem. In many studies researchers 
have made use of simple models for the gene-gene product (GP) interactions in- 
cluding Boolean and Baysian networks [3], and linear coupled differential equa- 
tions as in the works by [1], [S] and [S]- Although simple models are attractive, 
especially when dealing with a large number of genes, they often suffer from 
lack of physical relevance. The biochemical interactions on the molecular level 
are known to be more complicated than most simple models can account for. 
Nevertheless, the correlations between genes that many simple models predict 
can provide a course-grained view of the GRN. Another popular trend in sim- 
plifying the GRN problem is to focus on a subset of genes that are known to 
regulate each other under the assumption that no other gene has a regulatory 
influence on this sub-network. This type of approximation makes more compli- 
cated models, i.e. non-linear models, feasible as was shown in [7] and [8]. 

An additional difficulty is related to the fact that many genes are coex- 
pressed and exhibit thus basically the same expression profiles, and that even 
non-coexpressed genes may have similar profiles under certain circumstances 
and/or during a certain time span. As a consequence one cannot obtain the 
information about the correlations among all genes from DNA microarray data 
alone, regardless of the quality or complexity of the model at hand. This issue 
can be partly resolved by grouping genes with similar expression profiles into 
clusters, which allows one to shift focus from individual genes to the study of 
how clusters influence each other. Consequently, the dimension of the problem 
is reduced to the number of clusters, which in many cases is much less than a 
hundred. Such a drastic reduction in dimensionality also opens the door to more 
complicated (and hence more accurate) models, and can reveal a more realistic 
course-grained picture of the GRN. The last difficulty we address here is the 
multitude of possible GRNs that generate the same gene (or cluster) expression 
profile. This is sometimes referred to as gene elasticity ^ . 

In this paper we attempt to further the methods of identifying the con- 
nections among gene clusters on the basis of DNA microarray time series and 
propose criteria that eliminate many possible solutions for the gene cluster net- 



2 



work. The system under consideration consists of E. coli bacteria in a glucose- 
lactose environment. The GRN model we design takes into account gene-GP 
and GP-GP interactions that are derived from physical arguments. Our work 
is a further step towards reliably predicting cluster gene networks on the basis 
of DNA microarray time series. 

The paper is structured as follows. In section one we give a brief description 
of the known regulatory mechanisms in prokaryotic cells and derive a phys- 
ical model that describes them. We then adapt this model to be suited for 
cluster-cluster interactions. In section two we discuss the procedure of param- 
eter identification and parameter reduction. The application of our model to 
E. coli during glucose-lactose diauxie is covered in section three where we also 
discuss the criteria for GRN selection. In the last section, we summarize our 
results and discuss possible issues as well as outlooks for further studies in this 
field. 

2 Modeling the biochemical processes 

The living cell is like a small factory whose products (GPs), i.e. RNA and pro- 
teins, sustain it, allow it to divide, and even terminate its life. The cell absorbs 
various chemicals from the environment and uses them for many purposes. Some 
of them serve as fuel to drive its internal machinery, others are used for intra-cell 
or cell-to-cell signal propagation and many other essential functions. Genes and 
their promoter sequences act as pieces of software that hold the instructions for 
synthesizing RNA molecules, some of which, i.e. the (messenger) mRNAs, are 
then translated into proteins; all RNAs and proteins are collectively referred to 
as GPs and we will make no distinction between them in what follows. 

The GPs that bind to regulatory promoter sites and hence regulate the 
transcription of genes (direct regulation) are called transcription factors (TP). 
Other GPs which are not TPs themselves but bind to TPs can also influence 
gene regulation (indirect regulation). In fact, virtually all GPs can play a role, 
either directly or indirectly, in the regulation of gene expression. Depending on 
the external environment some genes may be highly active (expressed) while 
others can have low output or even be completely off. Which gene is expressed 
and when depends on the abundance of specific GPs and their affinity to bind 
the gene's regulatory sites or other GPs. Since gene regulation is much more 
complicated in eukaryotes than in prokaryotes (see e.g. [10] ) we concentrate on 
the latter. Hence, the rest of this article deals exclusively with prokaryotes. 

2.1 A model of prokaryotic gene regulation 

In figure 1. we show a toy model of three genes that mutually interact through 
their GPs. The arrows indicate the gene-GP and GP-GP interaction pathways. 
Because of thermodynamic instabilities, degradation by enzymes, transport to 
other cell compartments, and effects of dilution upon cell growth, GPs inevitably 
degrade or loose activity in some characteristic time. Depending on this time 
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Figure 1: A GRN of three genes showing the regulatory pathways. The cir- 
cle, square, and pentagon represent the GPs synthesized by genes 1, 2, and 3, 
respectively (from left to right), and the broken circle, square, and pentagon 
represent degraded GPs. The + and — signs correspond to activation and re- 
pression respectively. The third gene synthesizes a GP that is not a TF as it does 
not directly regulate any of the other two genes. Also shown is the interaction 
among the different GPs. 



GPs can have either long or short lasting influence on genes. In what follows 
we present a mathematical description of how the GPs influence a gene's tran- 
scription rate. 

We begin by assuming that time delays between the production of GPs and 
their influence on a gene are negligible compared to the times over which the 
concentration levels change significantly. This assumption is supported by com- 
paring the typical diffusion and transcription rates, and amounts to neglecting 
the effects of translational regulation and any distinction between RNA and 
proteins. 

The most general set of differential equations describing a system of AI genes 
under constant environment has the form 

X, = i?,(X,C,) (1) 

where i = l,...,Af. i?,(X,Ci), X = {Xi,...,Xm) and Q = (Ci,...,Cf ) are an 
influence function, the concentration vector of GPs and the set of all parameters 
pertaining to gene i, respectively. The influence function consists of two terms: 

i?,(X,C,) = /.(X,Q)+5.(X,C,), (2) 

with 

/i(X,C,)>0, g,(X,C,)<0. (3) 

The first term in Eq. ^ depends on the abilities of various GPs to bind a 
promoter of gene i, i. e. the binding affinities, and on the efficiency of recruiting 
the RNA-polymerase, while the second term gives the rate of GP degradation. 
To understand the structure of fi, let us first look at a simple example of two 
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TF's competing for the same promoter. Suppose that the transcription of gene 
i is enhanced by two activators 1 and 2. The probabihty that say 1 wih bind to 
the promoter is (see [TT]) 

with Xi and X2 being the concentrations of GP 1 and 2. The index A stands 
for activation. The parameters An and Ai2 are proportional to the frequencies 
of colhsions between the gene's promoter and the GPs 1 and 2, and to their 
binding affinities. The term 1 in the denominator indicates that the promoter 
may be unoccupied, and the term ^^2^2 comes from the fact that the GPs 1 
and 2 compete for the same promoter. Indeed, if the concentration of GP 2 
becomes very large, the probability for GP 1 to bind the promoter becomes 
low. On the other hand, if the converse were to occur this probability would 
approach one. 

For a transcription to occur the part of the promoter which admits only 
activators must be occupied while the part that admits repressors must be 
unoccupied. The probability for such a scenario is given by the probability 
that an activator is bound to the promoter multiplied by the probability that 
the converse is true for a repressor. The probability for gene i to have a certain 
rate of transcription is then 

Pi(x,ro = Pa(x,a,)(i-P5(x,s,)), (5) 

where = (A„ S,), > 0, Sf > 0, 

X - 



and 
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The repeated index j sums over all the GPs which influence gene i: AjXj = 
AijXj. The total probability for gene i to be occupied by any activator is 
then 

P,(X, = : — . (8) 

il + AiX,)(l + SiX,) 

Note that in deriving this equation we made the assumption that the expression 
of a gene may be activated or repressed by a single GP, and does not require 
complexes of GPs or cascades of interacting GPs. The motivation for this choice 
is that most genes in prokaryotes are regulated by forming DNA-protein com- 
plexes involving single proteins. 

Since X is a stochastic variable we have to average the transcription rate 
over an ensemble of many cells under the same external conditions: 

/,(x,r,) = (r,p,(x,r,)). (9) 
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The parameter is the maximum transcription rate corresponding to the sat- 
uration point at A^Xj — >■ oo and is taken to be independent of the particular 
combination of GPs that bind to gene i. Note that by averaging over an en- 
semble, the stochastic variables of the system become determininstic. For that 
purpose we first write in Eq. ^ Xi = Xi + rji where Xi is the average expression 
of gene i over an ensemble of identical cells and rji is a Gaussian noise function 
of the same gene. We then expand with respect to rj: 



/,(x,r,) = r, 



P,(x,r,)-H^P.(x,r,)(r],>+0((?7'» 



(10) 



Since {rjj) = for a Gaussian function, the first order approximation of the rate 
function fi in Eq. (fTO|) yields 



/.(X,r,) = r,P,(x,r,) = r ^l^^ (11) 

Various environmental conditions within the cell can cause the GPs to de- 
grade, or loose their activity, after some characteristic time Tc [H]. Since the 
ability of a GP to infiuence a gene depends on how long it remains active, those 
GPs with a long are more likely to bind a gene's promoter. The converse 
is true for GPs with short t^. We hypothesize that GPs mutually interact to 
either prolong (e.g. through stabilizing complexes) or shorten {e.g. through 
degradation by proteases) their Tc in order to provide another channel for gene 
control. The nature of GP-GP interaction is too complicated for our purposes 
here and will not be treated on the molecular level. Instead we want to write 
down a course-grained expression that corresponds to the behavior expected 
from the arguments just outlined. In particular, we expect that overabundance 
of any one GP would saturate its influence on other GPs. On the basis of this 
assumption, we define a general X-dependent degradation rate ki of the form: 

l^fc.(X,^.)= , (12) 

where ii — {K^ , jK-i), Kf > 0, — oo < Kf < oo, and n is the characteristic 
time associated with GP i. The two parameters and symbolize the 
maximum and minimum degradation rate respectively. The matrix elements K-j^ 
give the influence of GP j on GP i. Notice that when Xj is large and positive, 
fci(X, £i) approaches which corresponds to the longest for GP i while the 
opposite limit yields - the shortest t^. The sign of each matrix element 
Ki^ determines whether GP j has a stabilizing (a plus sign) or destabilizing (a 
minus sign) influence on GP i. As before, Eq. (jl2p must be averaged over the 
ensemble of cells. The degradation term gi of Eq. ([2]) can now be written as 

g,{X,£,)^-h{X,£i)X,. (13) 

Making the same substitution as before, Xi = Xi + rji, and expanding to the 
second order on 77 leads, after ensemble averaging, to the following flrst order 
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approximation: 



K+ + Kre^'''^ 

1 j_ ^K'.Xi 



(14) 



Combining Eqs. ([TT|) and (HH), the influence function in Eq. ([T]) becomes 



1 + e 



~Xi. 



(15) 



2.2 A model of gene cluster regulation 

Given that genes with similar expression profiles cannot be differentiated on the 
basis of data from microarray time series, we are led to group genes into clusters 
according to the similarity of their profiles. This compels us to rewrite our model 
in terms of gene clusters and treat the deviations in concentration levels from 
the cluster average as external perturbations. Let us write Xi = Xq + ^i, where 
is the deviation of gene i from the average concentration Xq of cluster q. This 
transforms all terms of the form A^Xj into A^Xp + AjS^j, where Af — J^j Al is 
the sum over all genes in cluster p denoted Np, and the sum over p goes over 
all clusters Nc- Inserting this into Eq. ([T|), expanding the right hand side up 
to first order on ^, and averaging over cluster q and over the ensemble of cells 
yields 



Xq = ipq{X)+Sq, 
^,(X) = FqiX) - Kq{1L)Xq, 



where X= {Xi,...X 



(16) 
(17) 



1 



i=l 



(18) 



and 



1 



1 

Kqm =—yk,{x,£,), 



N, 



9 ,-1 



Nq^ 



X 



'90 



(19) 



(20) 



In what follows, we only consider the zeroth order approximation: Sq = 0. 

To complete our journey from genes to clusters we must reformulate the 
functions under the summation in Eqs. ^TE\\ . ([TS]). The concentration levels 
Xp already represent the cluster average, but the parameters r^, A^, Sf, K^, 
and Kf need to be replaced with some effective parameters pq, a^, (3^, 
K~ , and kP, respectively. Even though doing this alters the local behav- 
ior of the multivariable functions defined in Eqs. (fT8|) and ([T9l) . their general 
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behavior will remain the same given an appropriate set of effective parameters. 
In other words, if one takes several functions that follow a certain behavior, 
i.e. starting linearly near zero and saturating for large values, their average will 
produce a function with the same behavior. A downside to this reformulation 
is a loss of oversight of the connection between the original set of parameters 
(r„ AP, SP, K+, K-,KP) and the effective set a = (p,, a^, pp, k+, k'^kP). The 
above reasoning thus yields the following model: 

aPXp 

(1 + a'qXp){l + p^Xp) 

K,{X) ^ \ (22) 

This model contains a large number of parameters compared to the number 
of data points, which raises the issue of overfitting. In the following sections we 
address this problem and demonstrate how a lower bound on the total number 
of parameters necessary to fit the data can be determined through parameter 
reduction. 



3 Parameter identification 



Having formulated a dynamical model of the cluster network we move on to 
identifying the unknown parameters a. A standard practice is to define a dis- 
tance function 



d(a) 



1 



EE 



dt 



A,(t)-(^,(X,a) 



1/2 



(23) 



where Nt is the number of time points provided by the data, and Nc represents 
the number of clusters. In order to estimate the derivatives dXg(t)/dt one must 
choose an interpolating function fitting the data points (discussed further in the 
next section). The function iy9g(X,a) defines the model; it is given by eq. pT)) . 
Minimizing c?(a) with respect to all the parameter sets a gives the model that 
fits the data the best. 

The minimization problem depends heavily on the total number of param- 
eters to be optimized and so, reducing the number of parameters is a valuable 
endeavor. We can start by noticing that if the parameter sets a^ for each cluster 
q are independent of each other one can minimize the parameter function 



1 



■ Nt 

E 



Xq{t) - ipq(X.,a.g 



1/2 



(24) 



with respect to the parameters a^ for each cluster separately. In this manner, 
the problem is reduced from optimizing 3Nc{Nc -1-1) parameters in one step to 
optimizing 3(A'^c + 1) parameters Nc times. 
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3.1 Parameter reduction 



We now present a useful method for reducing the number of parameters of 
specific types of models, namely those of the form Lpg{'X.,ujPXp,h'q), where at 
least some of the parameters, i. e. w^s, range from — cxi to +00. The remaining 
parameters, i^q, are positive. 

The temporal profile of each cluster consists of discrete time points. In order 
to determine the functions dXq/dt one must interpolate the data points with a 
smooth continuous curve. We chose for that purpose polynomial functions of 
order w. 

w 

^g = ^c,„r. (25) 

n=0 

Inserting Eq. (1^ into (pq{X.,ujPXp,Vq) gives 

iPqCK, LoPqXp, Uq) = ipq{Cqnt'^ , W^Cpnf, Vq) (26) 

where we again sum over repeated indices. We can now define the new matrix 
element i?g„ = ^qCpn which allows us to rewrite Eq. (|24l) as 



dqiBq, Vq) = 

1/2 

(27) 



1 



^ [Cqn-nf'-'^ - ipq{Cqnt'^,Bqnt'^,Vq)]' 
.t=l 



The number of parameters to be determined for each cluster now equals w + 1 
plus the number of elements in the set Vq. This simple procedure reduces the 
number of parameters and provides a degree of control when interpolating the 
data points. For instance, one may want to use low order polynomials at the 
expense of good data fit in order to reduce the number of parameters and thus 
avoid the problem of overfitting. 

After determining the new parameters Bqn by minimizing Eq. (j27p we can 
solve for the original matrix elements ujP. Notice however, that if w is less 
than the dimension of matrix we end up with many different solutions for 
the parameter set (a;*°, w^™) depending on what integers we assign to the 
indices For instance, if w — 4 then we may solve for the parameter 

sets {ujq,ujq,ujq,ujq,ujq) 01 {ujq , ujq , ujq , ujq^ , ojq'^) , and SO ou. Rcgardkss of which 
parameter set we chose to solve for, the fit initially determined by minimizing 
d will be the same. The number of parameter sets, and thus the number of 
solutions for each cluster, is Ncl/[{w + l)\{Nc — w— 1)]!. Hence the total number 
of ways the clusters can be connected to give the same value of d is [Ncl/[{w + 
l)\{N,-w-l)T^. 

The advantage of this method is twofold: firstly, it allows one to obtain all 
solutions for the parameters ui^ which yield the same value of d; and secondly, to 
obtain these solutions, one only has to solve a set of linear algebraic equations for 
each set {ujq° , ...,ujq™), thus significantly reducing the computation time. In the 
next section we will discuss criteria for selecting solutions most likely adopted 
by nature. 
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4 Modeling the glucose-lactose diauxie in E. coli 



We chose to model the gene expression profile of E. coli during glucose-lactose 
diauxie. The DNA microarray data was collected by Traxler et al H^. The 
diauxie experiment is designed to observe the response of an organism to envi- 
ronmental stress, i.e. starvation. In the case at hand, the E. coli colony was 
exposed to a mixture of two sugars, glucose and lactose. The initial reaction 
of the colony was to feed exclusively on glucose while steadily growing in size. 
Once glucose was exhausted the growth came to a halt for a certain amount of 
time after which it was resumed due to the onset of lactose consumption. While 
the exact mechanism of this metabolic switch is not known it has been hypoth- 
esized that the gene network of the organism becomes rewired in response to 
the changing environment (decrease in glucose) in order to survive. We want 
to model this metabolic transition and study how the cluster network changes 
with the varying conditions of the environment. 

The model we presented in the earlier section does not include any explicit 
time dependence due to environmental changes. We introduce this feature into 
our model according to the following observations and the conceptual model 
of glucose- lactose diauxie presented in (13| . The growth arrest happens very 
abruptly and therefore should not be linearly proportional to the depletion rate 
of the glucose. Rather, the sudden drop in the growth rate should be the result 
of the glucose level crossing a certain threshold below which a new GRN becomes 
active. The primary function of the new GRN should be to rapidly decrease cell 
growth while continuing to feed on glucose. During the time of growth arrest 
(mixed phase) the system makes a smooth crossover from the glucose to lactose 
phase in which the cell growth is resumed again. The system thus ends up with 
the GRN that is most suited for the consumption of lactose and cell growth. 
The processes just described can be represented in symbols as 

+ [1 - hP\t,n^)\\\ - h\t)]^la\%) + h\t)^[{X), (28) 

where g and I stand for glucose and lactose, respectively. The glucose and 
lactose phases are described by the models (p^ and 95' , respectively, whereas the 
mixed phase is a superposition of two models defined by ip\^'' and V'l^'- The 
functions allowing for these transitions are taken to have sigmoidal shapes: 



1 + {t/Tg)--^ 



Here the exponents rrig and m; are positive numbers that determine how abruptly 
the system transits from the glucose to the mixed phase and from the mixed to 
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the lactose phase, respectively. The constants Tg and r/ give the points in time 
of the respective transitions. In the mixed phase, the system makes a transition 
from one network to another characterized by the exponent rig and the time 
constant Tgi which we consider to be half way through the mixed phase. The 
functions in Eq. ([^^ can be thought of as average fractions of cells with a 
particular GRN. Although they have not been derived from experimental ob- 
servation, using sigmoidal functions is a standard practice in studying biological 
transitions. 

4.1 Data analysis and gene clustering 

Until this point we have been considering protein concentration levels as the 
quantity that is available from experimental data. However, the DNA microar- 
ray experiments detect the presence of mRNA molecules - the precursors of 
proteins. The pathway from mRNA to proteins occurs very quickly in prokary- 
otes and so, the concentrations of these two quantities have an approximately 
linear relationship [M] [2] . Hence the data on the levels of mRNA can be iden- 
tified with data on protein levels. 

The DNA microarray experiments as they are currently performed do not 
measure the absolute mRNA concentration directly. What they measure is 
the intensity of light emitted by the mRNAs after they are illuminated by a 
laser. The intensity / is approximately proportional to the absolute mRNA 
concentration X . The actual relationship between / and X follows a sigmoidal 
curve of the form / = aX/{l + bX) where a and b are probe specific parameters 
[15] . For simplicity we assume that the linear approximation / = aX is sufficient 
for our purposes. The DNA microarray data are usually presented in the form 



with loi being some constant background intensity or the intensity at a given 
time point in a well-defined environment. The index i refers to a particular 
gene. Solving this expression for Xi gives 



As we discussed earlier, a standard practice in microarray data analysis 
is gene clustering, to cope with the indistinguishability of groups of gene pro- 
files. Although in principle one may choose to cluster the mRNA concentrations 
ttiXi/Ioi, it is more relevant to cluster the Zi^s. The main reason for this is that 
the standard deviation of mRNA concentrations measured by DNA microarray 
techniques, due to noise and systematic experimental errors, has been shown to 
grow linearly with the expression level when this level exceeds some threshold. 
Taking the logarithm makes these errors additive rather than multiplicative [TB] . 
The clustering is thus less sensitive to the large errors on large concentrations 
when applied to the logarithms of the concentrations, and thus to the Z^'s. 

The genes are clustered on the basis of the similarity of their temporal expres- 
sion profiles. We use for that purpose an ordinary tree-like clustering algorithm. 



Zi = log2(/j//oi) = log2{a,Xi/Iot) 



(30) 




(31) 
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which starts by considering each gene as forming a class on its own and then 
groups classes two by two. In each step, the two classes are merged for which 
the average distance between ah pairs of gene profiles, Zi{t) and Zj{t), taken in 
either of the two classes, is minimum. The distance between the gene profiles is 
defined as 



Nt 



1/2 

(32) 



The procedure stops when the average distance < Dij > in the newly created 
class exceeds a certain threshold. We chose this threshold to be 0.45 which leads 
to 12 clusters, each represented by the average profile Zq{t). This clustering 
method could be modified by e. g. adding a shift or introducing a scaling factor 
in the distance function Dij, however, for our purposes the simplest one suffices. 
One could also choose different thresholds, but we limited ourselves here to a 
threshold giving a sufficiently low number of clusters while keeping the profiles 
in each cluster reasonably similar. 

The variation of each profile around the mean is defined as Zi = Zq -\- Zi. 
Making an expansion on Zi, Eq. (1311) then becomes 

J^^e^^''^^{l + zjn2). (33) 

Inserting this expression into our model, Eq. (|16p . simply redefines all model pa- 
rameters as C°-''/(a/oj) = C"j^^ and the deviation function as = e'^' '"^Zi In 2. 

The DNA microarray data is inflicted with random noise which makes the 
temporal evolution of concentration levels seem more disjunct than it actually 
is. In order to alleviate this problem we apply a simple filtering procedure to 
each cluster. We define the cluster average as a linear combination of Zq{tn) at 
the nth time point and the two neighboring points n~ I and n + 1. In symbols: 

Zq{tn) ^ Zq{tn), n^l,Nt (34) 

Zq{tn) 2Zq{tn) + -Zg(t„_i) + -Zq(t„+i), 

l<n< Nt 

While other filtering methods exist, this one is the simplest and has been suc- 
cessfully used before [4]. 



4.2 Criteria for network selection 

We used the global minimization algorithms on Mathematica to minimize the 
distance functions defined in Eq. (IM)) for each cluster. Since only a few time 
points of the data set belong to the glucose phase, our modeling procedure 
cannot be reliably applied in this temporal region without the risk of overfitting. 
For this reason, we begin our analysis with the glucose-lactose transition phase, 
which we estimate to start at the third time point, and continue with the lactose 
phase passed the time point number eight until the last (seventeenth) time point. 
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Hence, the number of time points in each respective phase is: — 3, iVf' = 6, 
and A''( = 10. These time points span a few hours. 

As mentioned before, the number of possible networks which give the same fit 
is very large. However, a good fit does not guarantee that the temporal evolution 
of the system will be stable with respect to the small deviations Aq{t) = Xq{t) — 
Xq{t), where Xq{t) is the modeled curve and Xq{t) the interpolated data curve. 
Since our model contains terms such as C^Xp, one can see that a deviation from 
the interpolated curve Xq{t) = Xq{t) + A,(t) wiU lead to C^Xp + Cp\. Unless 
C^Ap is small it will cause the system to deviate more and more after each 
iteration of the differential equation solving algorithm. We therefore argue that 
small parameters are likely to lead to greater stability than large parameters. 
While we do not give a formal proof here we report that running simulations 
with different sets of parameters do support this argument. Although large 
parameter values make the system unstable, the opposite cannot always be said 
of small parameters. Once we select the solution set with the smallest parameter 
values we must weed out the ones that fit the data points poorly. This can be 
done by computing the quantity 

1/2 

(35) 

and keeping the parameter sets which give the lowest ft. 

Another restriction we impose on the possible solutions is that the system 
must settle in a fixed point after some relaxation time in the absence of external 
perturbations. We argue that the fixed point should be of the same order 
of magnitude as the average vector (X) = (l/-^t) X^n -^(*")- hase this 

assumption on the observation that even the most abrupt changes during the 
diauxie experiment lead to the log intensity levels no larger than |Z| « 2. It is 
therefore reasonable to suppose that fixed points which differ by more than one 
order of magnitude from (X) are not biologically meaningful. We quantify this 
criterion by defining the scalar quantity 

X=|(X)-X(i^too)|, (36) 

where too was chosen to be three times the difference between the first and the 
last time point. 

Random mutations in the genes and GPs can be beneficial to biological sys- 
tems; however, in many cases they degrade their performance and even become 
lethal. Other random variations such as temperature, pH factor, diet change, 
etc. can also hinder the phenotype of a biological system. All of these changes 
translate into the alteration of some network connections, i.e. parameters a^, 
pp, and K^. However, survival of biological systems partly relies on the fact 
that their parameters are not rigid but can vary within a certain range (see 
Gutenkunst et al [17j). A system which is robust with respect to perturbations 
of the network connections is therefore well suited for survival (for more detailed 



1 ^* 

n—1 



X(t„)-X(t„)|' 
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discussion of robustness, see [H]). We define a parameter robustness function 

n— 1 2—1 s— 1 \ / 

where the dummy variable s runs over all model parameters and Cg stands 
for a particular parameter. The partial derivative compares a system with 
one of its variables perturbed by a small amount to the unperturbed system. 
Those networks which gave the smallest value of /i were preferentially selected 
as possible candidates over the others. We should mention that small values of 
fj, can have two implications. Either the system is very sensitive to only a few 
parameter changes, or it is mildly sensitive to many parameter perturbations. 
The former would imply that certain parameter values must be preserved at all 
cost in order for the system to function properly, while the latter necessitates 
that a large alteration in one, or several, of the parameters must occur for a 
significant phenotypic change. Irrespective of which one of these scenarios takes 
place, a system with the lowest fi is said to be most robust. 



(37) 



5 Results and Discussion 

The abundance of the information obtained from DNA microarrays scales with 
the number of time points. Given the scarcity of data points in the glucose 
phase and the fact that the concentration levels are nearly constant (probably 
because the system has reached a steady state or fixed point), we cannot obtain 
reliable information about the gene network in this temporal region. Therefore, 
we start at the third time point which marks the beginning of the growth arrest. 
The last three points approach another plateau due to depletion of the lactose. 
Since we do not introduce this feature to our model we stop at the fourteenth 
point. 

Following the procedures of parameter reduction and parameter identifica- 
tion detailed in the preceding sections we found that the mixed phase requires a 
polynomial of order eight to have a good interpolation between the data points 
(see Eq. ((25|)). In the lactose phase the interpolating polynomial turned out to 
be of order four. The global optimization algorithms give less reliable results 
as the number of parameters grows. For this reason, we separated the problem 
into two parts. First, we considered Eq. (|22|) to be independent of the k^'s, i.e. 
Kq (X) = /t+ + K~ , which leaves Eq. (|21|) as the source of transcription control. 
We then minimized the distance function Eq. ([24| with respect to pq, k^, 
and /3^, and recorded how well it fitted the data. Second, we set Eq. (|2T|) to a 
constant, i.e. i^q(X) = pg, and optimized the distance function dq with respect 
to pq, At^, and the Bp^s. 

The application of this approach to the lactose phase showed that imposing 
the latter assumption (i^g(X) = constant) allowed us to fit the data orders 
of magnitude better than when imposing the former assumption (KqCX.) = 
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Figure 2: Temporal expression profile of cluster 9. The vertical axis is propor- 
tional to the absolute concentration levels X, while the horizontal axis represents 
the time in hours. The dots and the solid line represent the data points filtered 
according to Eq. (|34|) and the modeled curve, respectively. The transition from 
mixed to lactose phase occurs at time 2.4 hours, a) Temporal evolution dur- 
ing the diauxie. b) Extrapolation of the expression profile well outside of the 
experimental data showing the expression level approaching a fixed point. 



constant) for clusters 1 through 10. For clusters 11 and 12 we had to include 
all parameters contained in our model and found that the only nonzero /3^'s are 
those with the index p having values p = 4, 8, 12 and p = 11 for cluster 11 and 
12 respectively. 

Application of the procedure just outlined to the mixed phase yielded similar 
results, namely, that keeping Fg(X), rather than i^g(X), constant for all clusters 
gives a much better fit of the data. However, in the mixed phase, minimization 
with respect to the parameters Bqn of Eq. ([27l) yielded k^'s that were very 
large. Due to this complication we resorted to the conventional way of parameter 
identification, Eq. ()24|) . and optimized dq with respect to the original parameters 
K^'s. The latter gave good results while keeping the k^'s small. 

These results suggest that the effects of GP-GP interaction, as described by 
the second term of Eq. (1211) , are absolutely necessary in gene regulation during 
the glucose-lactose diauxie. They also imply that the rate of transcription, 
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corresponding to the first term of Eq. (PT|) . is relatively constant indicating 
that the GPs which participate in gene activation are abundant while the ones 
that inhibit transcription are low in concentration. Another observation one 
can make is that the GRN is completely connected in the mixed phase and 
becomes more sparse in the lactose phase. This means that in the mixed phase 
there is no room for different parameter sets - only one network accomplishes 
the temporal profiles given by the data. 

In contrast, the number of possible parameter sets in the lactose phase is very 
large. In order to pick out the most probable network in the lactose phase we 
employed the network selection criteria described in the previous section. First, 
we selected for each cluster the five parameter sets with the smallest parameter 
values and then ran simulations for 350 randomly chosen combinations among 
the clusters (refer to section 3A) . For each combination we computed the three 
quantities fi, x ^md /x (Eqs. ([551) . ([57)) ). which monitor the goodness of 

fit, the approach to a fixed point and the robustness, respectively. We ended up 
with only 10 combinations that yielded small values for all three criteria. Note 
that the fixed point criterion x showed a discontinuity in the possible values it 
could take centering around the numbers w 1 or w 15. Figure 4 shows a three 
dimensional plot representing $7, x, and /i. One can see that the concentration 
of points nearest to zero is relatively low. The isolated group of 10 points within 
the circle comprises the best candidates for the GRN in the lactose phase. For 
a particular GRN in the lactose phase, we exhibit in Figs 3 a and 3 b the data 
fit of cluster 9 between the time points 1 and 14, and the extrapolated curve 
showing the fixed point, respectively. For temporal profiles of the other clusters 
refer to Fig 1. and Fig 2. of the supplementary material. 

Although the "true" GRN cannot be determined with certainty, one can 
hope to at least identify the connections that are indispensable. By comparing 
different possible GRNs we can assign more importance to the connections that 
appear most often. We define the average connectivity: 

H)-Jill<i (38) 

n 

where M = 10 is the number of GRNs considered and is the connection 
between clusters p and q (see Eq. p2|) ) given by the nth GRN. The associated 
standard deviation is given by: 

1/2 

(39) 

between clusters q and p. If a connection (Kp has a large value its contribution 
to gene regulation is significant. However, if dgp is also large, i.e. <jqp ^ (k^), the 
certainty of this connection's value is low and one cannot consider its significance 
with confidence. Another important common factor is the uniformity of the sign 
for each connection. If a connection has a positive sign in one solution, it should 
have the same sign in all the other solutions. 



M 
n=l 
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Figure 3: Plot ol il, Xj and ^. The points within the circle correspond to the 
best candidates for the "true" GRN. 



To have an objective measure of how important a connection is, on the basis 
of its strength, standard deviation and sign, we define a significance factor which 
ranges from to 1: 

5,, = Ji^(l-e-IK)l/'^..)e-^±, (40) 

I \^max/ I 

where KKma^)! is the value of the largest average connection and N± is the 
number of times a connection changes sign. In figure 4 we showed the gene 
network in the lactose phase, with only the significant connections indicated, as 
defined by Sqp > 0.1. 



5.1 Concluding remarks and outlook 

We have presented a detailed analysis of the problem of GRN inference through 
the design of a model which captures the biochemical effects between genes and 
CPs as well as the interaction among the GPs themselves. We hypothesized 
that the most important role of the GP-GP interaction is to vary (increase or 
decrease) the characteristic time during which a GP can perform its function. 
The agreement between data and simulation based on our model suggests that 
the role of the interaction among GPs is essential in GRNs. To our surprise, 
the regulation of genes by direct binding of GPs to the genes' promoters, as 
described by Eq. (fTTj) . amounted to a constant independent of time in all except 
two clusters, 11 and 12, in the lactose phase. Although one may be tempted 
to conclude from this result that the transcription rates of nearly all genes are 
constant in both phases, leaving the non-constant degradation rate in charge 
of the gene regulation, it should be kept in mind that Eqs ([2T|) and (|22]) deal 
with gene clusters, not individual genes. The transcription rates of all genes in 
a particular cluster may exhibit temporal variations while yielding a constant 
value when averaged over the cluster. Therefore, our results must be interpreted 
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Figure 4: A plausible gene cluster network in the lactose phase based on our 
criteria. The full lines correspond to the k^'s of Eq. ((22)) while the dashed lines 
represent the a^s and /3^s in Eq. ([2T|) . 



in the context of cluster network and cannot be directly compared to data on 
networks containing individual genes. 

The cluster network in the lactose phase is very sparse compared to that in 
the mixed phase. Previous works on the dynamic robustness of GRNs suggests 
that biological networks with low connectivity are better suited for survival than 
more densely connected networks [19]. Our results suggest that under external 
stress, e.g. starvation, the GRN of E. coli becomes highly connected in order to 
adapt to the suboptimal conditions. This implies that while in the mixed phase, 
E. coli is more vulnerable to random external perturbations, upon transition to 
the lactose phase the robustness with respect to environmental insults becomes 
restored. 

The complete connectivity of the mixed phase can also be taken to mean 
that upon depletion of glucose the different cells try different GRNs, each of 
which is possibly sparse [2^ . Under this assumption, the DNA microarray data 
would correspond to a superimposition of different GRNs experimented by the 
system until it finds the right GRN, which allows it to feed on lactose. More 
experimental and theoretical work will be needed to settle this issue. 
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